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The Design and Implementation of Cost-effective algorithms for 
direct solution of banded linear systems on the Vector 
Processor System_32 supercomputer, located at NASA Langley 

AUGUSTINE S. SAMBA 

Department of Mathematics, Hampton University 

ABSTRACT 

The problem of solving banded linear systems by direct (non-iterative) 

techniques on the Vector Processor System (VPS ) 32 supercomputer is considered. 

This report describes two very efficient direct methods for solving banded 
linear systems on the VPS_32. 

The Vector Cyclic Reduction (VCR) algorithm is discussed in detail. A 
Performance of the VCR on a three parameter model problem is also illustrated. 
The VCR is an adaptation of the conventional point cyclic reduction algorithm. 

The second direct method is the "Customized Reduction of Augmented 
Triangles (CRAT)". CRAT is a new method created and custom designed for the 
VPSJ32 by this principal investigator. CRAT has the dominant characteristics 
of an efficient VPS_32 algorithm. CRAT is tailored to the pipeline 
architecture of the VPS_32 and as a consequence the algorithm is implicitly 
vectorizable. 

Contained in this report is a proposal for an additional one year funding 
in order to undertake further investigations of banded techniques on the Vector 
Processor System_32 supercomputer. 



1.0 INTRODUCTION 


1 .1 An overview of the Vector Processor System (VPS) 32 supercomputer 

The VPS_32 is a high speed vector computer, with the following special 
characteristics: 

(i) It is a Single Instruction stream Multiple Data stream (SIMD) type of 
computer system. An SIMD system has only one stream of instructions 
in execution at anytime, but each instruction may affect many 
different data. 

( i i ) It is a pipeline computer. The basic idea behind pipeline computers 
is essentially that of an assembly line: if the same arithmetic 
operations is going to be repeated many times, throughput can be 
increased by dividing the operation into a sequence of sub-tasks and 
maintaining a flow of operand pairs in various stages of completion. 
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1 .2 Using the Vector Processor System 32 supercomputer 


Data items of the VPS_32 exist mainly in scalar and vector modes. A 
matrix may be viewed as a set of column/row or diagonal vectors. The VPS_32 
works efficiently in vector mode. The time (T) required to perform an 

n-element vector operation by the VPS 32 is given by 

T = S + n/N (1.2.1) 

where 

T represents the time in minor cycles 

S the vector startup time and 

N the number of results/cycle emerging from the pipeline. 

Equation (1.2.1) demonstrates that by keeping the startup time to a 
minimum, the overall time needed to perform a given operation on the VPS_32 
will depend crutially on the size of n, which also gives a measure of the 
number of computations. Consequently, algoritms designed for the VPS will 
generally perform efficiently if the operations involve long vectors as opposed 
to short vectors. 'Long vectors' are deliverables of efficient data 
organi zati on (management ) . 

A salient feature about data manipulation on the VPS is that operations on 
logical items are relatively faster than equivalent operations on other types 
of compatible items. In practice, efficient algorithms for the VPS_32 
supercomputer are uniquely influenced by its pipeline architecture. 

Our major interest in the VPS_32 supercomputer is as vehicle for solving 
Banded linear systems. Two very efficient methods for solving Banded linear 
systems are described in this report. 

Section (2.0) describes the Vector Cyclic Reduction (VCR) algorithm. The 
VCR is an adaptation of the Conventional Point Cyclic Reduction algorithm [1,2]. 
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The Customized Reduction of Augmented Triangles (CRAT) is a new algorithm 
created and custom designed for the VPS_32, by this principle investigator. 

The CRAT algorithm is tailored to the pipeline architecture of the VPS_32 and 
as a consequence this algorithm is implicitly vectorizable. CRAT is unique; it 
incorporates the dominant characteristics of a very efficient vector algorithm 
for the VPS_32 supercomputer. CRAT is discussed in section (4.0). 

Section (7.0) outlines a proposal for an additional one year funding in 
order to undertake further investigation of banded techniques on the VPS_32 
supercomputer. 
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2.0 "Vector Cyclic Reduction" (VCR) METHOD 


Recall the banded linear system of equations is defined as 
Ax = y 2.1 

where a. . = o.o if |i - j| > m ; l<i , j<n . m represents the 

* 5 J 

semi -bandwidth. 

The technique here is to successively decouple the banded system into a 
sequence of smaller systems. The number of steps required to complete the 
decoupling is 0(1 og^n ) . 

To illustrate this procedure let us identify the entries of the 
coefficient matrix A, in the following manner. 



where c, = 0 = b 2.2 

I n 

Consider now any three successive equations of the linear system (2.1). 
The (i-l)^, (i)^* 1 and (i+l)*'* 1 equations are respectively. 


(i -1 ) th : c x + a x + b x=y 

i - 1 i -2 i - 1 i - 1 i-l“i _ i-l 

( i ) th : cx +ax+bx L = y 

i i-1 i i i i+1 i 
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and 


(i+l)th: c >< + a _x + b >< = y 

i+1 i i+1 i+1 i+1 i+2 i+1 

We can eliminate x^_-| and x^ + -j from the (i)^ equation to yield an equation 
for 21-j > and x^ as follows. 

Normalizing the (i)** 1 equation with respect to the coefficient of _x. 
produces: 


(i)th : c^x 


♦ x ♦ b (,) x . y (1 > 


i i-1 i i i+1 


2.3 


where 


c ll > = a- 1 c 
i i i 


.( 1 ) 


a’ 1 b 
i i 


2.4 


.(1) 


a" 1 y 
i i 


.( 1 ) 


= a 


ith i / . , t \ th 


The corresponding (i-1) and (i+1) equations are respectively 


(i-l)th: c (1) x 


+ x + b^ x = y^ 


2.5 


i-1 “i-2 “i-1 i-1 “i-1 “i-1 


and 


(i+1 ) th c (1) x^ + x + b^ x^ = y^ 
i+1 i i+1 i+1 i+2 i+1 


2.6 
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where the corresponding coefficients and the Right Hand Sides (RHS) 
respectively satisfy the corresponding relations in (2.4). 


Substituting 

now, in 

(2.3) 

for 

x^ _i and 

x^ + -| from 

(i+l)th: c (2) x 

+ X 

+ 

b< 2 > 

X 

y (2) 

i i 

-2 ”1 


i 

~i+2 

i 

where 






a (2 > = 

1 -C m 

b (,) 


b (1 > 

c (1) 

i 

i 

1-1 


i 

i+T 

c t2 > = 

- (a (2) )" 

1 (c 

(1) 

c (1 » 

) 

i 

i 


i 

i-1 


b (2 > . 

- (a< 2 >r 

1 (b 

(1) 

b (1 >) 


i 

i 


i 

1+1 


and 






y = 

(a 121 )- 1 

(y (1 > - 

c (1 > 

y (1) - 

i 

i 

1 


i 

“i-1 

Substituting 

for *1-2 

and x 

-i+2 

in (2.7) 

from the i 


b (1) y ) 

i “i+1 


2.7 


v th 


.th 


(i+2) equations respectively, produces a relation in x,..^, and for 
the ( i ) ^ n equation. 

The elimination proceeds in this manner. 

In general, the k th (k>2) step is related to the ( k-1 ) th step by 


,00 = ! . c (k-l) b (k-l) 
i i i-1 


bifc-D C (k-D, 
i i+i 


.no _ 


- (a'"’)- 1 


(c 


(k-1) 


c (k-l)) 

i-1 
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and 


(k) 




(b 


(k-1 ) 


b <k - ,) ) 

i+1 


,00 . (a 00 )-l (y(k-l) _ C (k-D y(k-l) . b (k-l) y(k-U, 

‘i i i i i-1 i i+1 


where 


(k) 

The matrix W at each step, k, comprises of three diagonals. The 
distance of the outer diagonals from the principal diagonal doubles at each 
step. 

j. i_ 

Following the final n Ln (n = log2N) step, the two outer diagonals skip out 
of the coefficient matrix A^, leaving only the main diagonal. 
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2.1 The Vector Cyclic Reduction (VCR) Algorithm 


Let a i , b.j and c. , (i = 1 , 2, . . .N) represent the block matrices on the main, 
sub and super-diagonals respectively of the matrix A. Suppose the y i , 

(i = 1, 2, ... N) represent the components of the right band vector y. 

The VCR algorithm then proceeds in the following manner. 


Step 1. INITIALIZATION: 


Set c.j -*-0 

" b. «■ o 


Step 2. COMPUTATIONS: 

For J = 1 STEP 1 UNTIL (Log 2 N) do 
BEGIN 

Compute c.j «• 


a. ^c. 
i i 


1 <i <n 


a i’ b i 




a. y . 

l 


1 -° - c i b i-k ■ b i c i+k 




y. - c. y. , - b. y. 

— l ~{ — i - k i — i 


+k 


c i 


' c i c i-k 


9 



Compute b i + - b.. b..^ l<i<n 

" k <- k + k 

END 

Step 3. Now Obtain the Solution 

Compute x.j = aT 1 ^. " 
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3.0 Numerical Example 


The following banded linear system is a test problem designed to validate the 
algorithms described in this report. 

We wish to solve 

Ax = y (3.1) 

where A is the three-parameter band matrix: 

S 


A(m;n;s] 




S 

1 


The matrix A has semi -bandwidth m, dimensions nxn. 

The parameter S is defined for 1/10 ±S< 1/3. The RHS y is given by 


y, = H A ( i , j ) 
1 J=1 


Therefore (3.1) has the unique solution 


x i =1, (l<i<n) 
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3.1 SPEED-UP for the VECTOR CYCLIC REDUCTION ALGORITHM; TIMINGS 

for the VCR ROUTINES 

Table (3.1) gives a summary of the timings for the initial 
tridiagonal VCR Subroutine! PRELUDE, The PRELUDE subroutine is 
described in figure A3 of Appendix -A. 

Two very efficient subroutines VCRI and VCRII are illustrat- 
ed in figures AI and All respectively of Appendix-A. Both VCRI 
and VCRII subroutines yield faster times than the PRELUDE subrou- 
tine. This is because both. VCRI and VCRII subroutines employ 
better and more efficient da,ta mapping schemes than the PRELUDE 
subroutine. 

The ma-Jor difference between the VCRI; and VCRII subroutines 
is that VCRII incorporates Cyber -2 QO generic vector functions as 
a means of improving the vector processing power. The VCRI sub?-, 
routine, on the other hand, seeks' to generate and access sets of 
items (scalar) in contiguous core locations, 


System Size 

Bandwidth ‘ 

Computing Timings 

(n x n) 

(m) 

(second) 

64 x 64 

2 

. 000663 

1024 x 1024 

2 

. 00278 

4096 x 4096 

2 

.00313 

Table (3.1) . 

Computing timings for 
subroutine : PRELUDE 

the initial VCR. 
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4.0 The Customized Reduction of Augmented Triangles (CRAT) Method 

4.1 Motivations for CRAT 

The Vector Cyclic Reduction (VCR) algorithm has four disadvantages viewed 
purely as a technique for solving banded systems on the Vector Processor 
System_32 supercomputer: 

1) The partitioning of a band matrix in tri-diagonal form yields super 
and sub-diagonal blocks which are triangular; but the technique takes 
no advantage of this. 

2) While it may be easy to process efficiently several RHS 
simultaneously, it is not possible to re-enter the routine with 
further righthand sides without repeating the whole reduction process. 

3) The diagonal blocks are assumed non-singular throughout, and no 
pivoting between blocks is possible, although pivoting within blocks 
can be achieved. 

4) Being an adaptation of the conventional serial (point cyclic 
reduction) algorithm, vectorization is introduced explicitly by 
employing, 

(a) meticulous data organization and 

(b) selecting computational tools that exploit the data structure. 

The processing power of these tools is given very little or no 
consideration. 

I have therefore designed an altenative scheme which is custom 
made for the VPS_32. The CRAT's prespective on points 1) to 4) 
above is as follows: 

Point 1: CRAT takes full advantage of the form of the band 

matrix 
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Point 2 


Not yet studied 

Point 3: CRAT works efficiently with singular diagonal blocks 

Point 4: CRAT is designed to exploit the structure of the band 

system and the pipeline architecture of the VPS_32. 
Therefore vectorization is implicit. 

A complete description of the CRAT method now follows. 
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4.2 The CRAT Methodology 


The stages in the CRAT method are as follows: 


4.2.1 Customization of the Banded Algorithm 


The primary goal of customizing the banded algorithm is to efficiently 
exploit the pipeline feature of the Vector Processor System (VPS)_32 
supercomputer. In order to achieve this goal it is necessary to tailor the 
solution methodology for the Banded System to the architecture of the Vector 
Processor System_32 Supercomputer. The initial task involves transforming the 
Banded System to a more suitable (tractable) problem whose solution could be 
easily and cheaply obtained by utilizing the optimal vector interpretive 
schemes of the VPS_32 supercomputer. 

Recall the nxn banded linear system for the Iv order vector u is given by 


W i , 3 U j 


= b. 




1 , 2 , 


(4.1) 


where w. . = 0 iff |i-j | > m ; m represents the semi -bandwidth. 

■ »J 

We seek to transform (4.1) to an over-determined problem by means of the 
following technique. Define 

i U 

i) an (n+mr order vector v in terms of u by 


v . 
j+m 


ru, , iff 1 lj IN 

J 

. o , otherwise 


(4.2) 
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4.2.2 Problem Decomposition 

The coefficient matrix A of (4.4) is partitioned into 2n conformable 
triangular matrices A.. , , i=l, 2, ... n (= N/2m). 
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A 1 B 1 


A 2 B 2 


A = 


4.4.1 


n-1 

A B 
n n 


The matrices A., 1 £i £ n are upper triangular with dimensions 2m x 2m. 
The , l£i£n are 2mx2m lower triangular matrices. 

■f* h 

The RHS vector is partitioned into n (2mr order subvectors y ^ , 
i=l,2,...,n by the relation: 


4 = [bj , j = 1+2 (i-1 ) m, ..., 21m], 


l£j£n (4.5) 


Note the inverse relation 


b* = [y., i 2 , .... Yp] . (4.6) 

JL L 

Then if £ is partitioned into (n+1) (2m) tn order subvectors x_. , i = 1 , 2 , ..., n 
by the relation 

4 = [Vj, j = l+2(i-l)m, ..., 2 im], l£j£n , (4.7) 

with inverse relation 

— ~ ] 5 —2 5 • * * * £^4-1 ^ (4.8) 
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the over-determined problem can be expressed as a vector recurrence problem: 


A i X/j + B i *i+i = L\> (l<i<n) (4.9) 

The task of designing an efficient solution methodology to (4.9), and hence 

the Banded System, for the VPS 32 supercomputer is the subject of the following 

section. 

4.2.3 Reduction of Augmented Triangles 


Observe that the of (4.9) are upper and lower triangular matrices 

respectively. Without any loss of generality it will be assumed that n is 
integer, power 2. 

With a view to exploiting the structure of the recurrence problem and the 
architecture of the Vector Processor System_32 supercomputer, it is instructive to 
pose equation (3.9) as a set of two subsystems: 

Subsystem I : 

A i x i + B i x i+1 = y .. , l<i<& (=n/2) (4.10) 

and Subsystem II: 

A i^i + B i x i+1 = y .j , l+s,j<i<n (4.11) 

The single Instruction stream Multiple Data stream capabilities coupled with 
the "assembly-line" processing characteristics of the VPS_32 supercomputer provide 
a clinical and unique means for reducing subsystems I and II simultaneously. This 
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reduction process is easily accomplished via a culmination of dynamic partitioning 
technique; and augmentations of the coefficient triangular matrices in (4.10) and 
(4.11). The reduction process takes only 0(log2 2 ’) steps. The method in detail is 
as follows. 


Consider Subsystem I: 

Normalizing the j th , llj£*., equation with respect to A. and writing the 
normalized coefficients with zero superscript gives 


x t o ( 0 ) _ (0) 

Ai + B j ij+1 ‘ ’ 


l<j<£- 


(4.10.1) 


1. L 

The corresponding expression for the (j+1) , l<j<&, equation is 


x + B (0) x = v I0) 
ij+1 "j+1 — j+2 ij+1 


1 < j < A- 1 


(4.10.2) 


Substituting into (4.10.1) for x.. + .j from (4.10.2) gives 


x - B (0) B (0) x 

B J B J+1 -J+2 




( 0 ) 


( 0 ) ( 0 ) 

j ^3+1 


(4.10.3) 


A. L. 

for the j , (l<j<a-l), equation. 


i. L. 

The corresponding (J+2r equation is given by 


x - B (0) B (0) x - v (0) n<°> „«» 

-j+2 B j+2 j+3 — j+4 " —j+2 B j+2 — j+3 


(4.10.4) 


By substituting for Xj +2 into (4.10.3) from 4.10.4) we obtain 


+ Bf>x J+4 = y< 2 > 


(4.10.5) 
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where 


CO 

C_I. ^ — 

o 

II 

A i ,B j 


b 4 U - 

-B^B^. 0 ] * 
J J+1 


II 

CM 

CO 

R (°) R (°) R (0) 
_B j B j+1 B j+2 

„(0) 

B j+3 

II 

o 

a: 1 y. 
J -J 


II 

y<°>- B ( . 0) y<°> 
-J J ^J+l 


y (2) = 
-j 

y ( - 0) - B ( . 0) y ( .?| 
-J J -J+l 

. „(0) R (0) 
+ B j B j+1 


[y 


( 0 ) 


B (°) v (o) 1 

j+2 £j+3 3 


Observe that (4.10.5) can be expanded in terms of Xj +4 to get a 

f* h 

involving >c_. and . In general, the K cr stage is related to the 
by the following relations: 


where 


x + B (K) x k - „<K> 

-j + B j-j+2 K - 


„(K) . a (K-l ) r (K-1 ) 
B j ' ' B j B J+K 


V (K) _ (K— 1 ) „(K-1) {K-l ) 

‘ B j ij+K • 


(1<J+K<M 

The reduction process for Subsystem I advances in this manner. 


(4.10.6) 


relation 
(K-l )^ stage 

(4.10.7) 

(4.10.8) 

(4.10.9) 
The structure 
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of Subsystem I at the end of the final reduction step is given by 

Xj + B j P) x a+ i = y( p) (l<j<a) (4.10.10) 

where p = log 2 *> = log 2 n/2 = log 2 n - 1. 

Hence, p = log 2 n - 1 corresponds to the penultimate step of the CRAT algorithm. 

We now focus on Subsystem II: 


A i £i 


B i Vi = lv 


l+x,<i <n. 


Subsystem II is reduced by employing arithmetic operations identical to those 
for Subsystem I, but with opposite coefficient triangular matrices. 

±. L. 

Normalizing the j , [ 1 +A_<j<n] equation with respect to B. and writing the 

J 

normalized coefficients with zero superscripts gives 

A j 0) x. + x. +] = yj 0) , l + «,<j<n (4.11.1) 

i. L. 

The corresponding expression for the (j-1; equation is 


,( 0 ) 


.( 0 ) 


a j-i Am + Aj = -j-i 


substituting for x . in (4.11.1) from (4.11.2) gives 

J 


-A (0) A (0) x + x = v (0) - A 101 v t0) 

fl j A j-1 Aj-1 A)+l l 3 1 z j-i 


±. L. 

The corresponding (j-2) tn equation is 


-A <°> A<°> x, , + x. , - y<°> 

J-2 j-3 — j-3 -j-1 -J-2 


A (°) y (0) 

J~2 —j-3 


(4.11.2) 


(4.11.3) 


(4.11.4) 
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Substituting now for x. -| from (4.11.4) into (4.11.3) we obtain 


where 


and 


A (2) x + x = y (2) 
A j -j-3 -j+1 


aJ. 0) = BT 1 A, 

J J J 


( 1 ) _ ( 0 ) ( 0 ) 

A j J A j-1 


a (2) _ ,(0) A (0) (0) A (0) 

A j A j A j-1 j-2 A j-3 




( 0 ) 


B" 1 y. 
J -J 




( 1 ) 


y ( . 0) - A ( . 0) y<°{ 
— J J -J") 


(2) = (0)_ A (0) (0) A (0) A (0) ( _ A (0) (0) 
A j ^j-1 A j J-l —j-2 J-2 —j-3 


Equation (4.11.5) is similarly expanded in terms of Xj_ 3 to get a 


x J+ l and Xj_ 7 . 


,th 


In general, the K (K>1) reduction stage for subsystem II i 
(K-l) th stage by the following relations 


A (k) x + x v (K) 

J -j. 2 k + i -J +1 


(4.11.5) 


(4.11.6) 


) 

new relation in 
s related to the 

(4.11.7) 
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where 


(K) «(K-1 ) « (K-l ) 

J ' " A j J-K 


and 



V (K-1)_ a (K-1) 
-j J 



1 +«.<, j<n 


At the end of the final reduction step (p), 

p = log 2 ^ = log 2 n _ 1 » 

subsystem II takes the form, 

Aj P) x a+1 + x j+] = Yj p) n+l<j<n 

4.2.4 Final Step: Elimination of x ^ +1 

The log 2 n step is the final step of the algorithm. Here we 
x^l from (4.10.10) and (4.11.10) as follows. 

Combining the equations of the reduced subsystems I and II 
(4.11.10) respectively leads to 

x*. + C j P) — &+1 = Xj P) ’ 


(4.11.8) 

(4.11.9) 


(4.11.10) 

simply eliminate 
of (4.10.10) and 

(4.12) 
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where 


and 


• (p) J 


,(p) 


- a j 


(p) 


, iff 
, iff 


l<j<n/2 
n/2 + 1 <j <n 


i. 

J 



j , iff 
j+1, iff 


l<j<n/2 

n/2 + 1 n . . 


An analysis of equation (4.12) reveals the following. 
Case (I) . 

When J=1 , (4.12) becomes 


x + C (p) x = v (p) 

ill + L i 2U+i - l] 


(4.13) 


A. U 

The components of the (2m) order vector x^ are easily determined from 
relations (4.7) and (4.2), to be 


^1 = 


0 

0 


0 

u. 


m 


Notice the 
m leading zero 


(4.14) 
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Substituting x_-| from (4.14) into (4.13) gives 


B i n)(i ’j) h+] (j<) = y| P)(i ) (4.15) 

J=1 i=l 



for the leading m - equations. 
Case (II) . 

When j=n (4.12) becomes 


x , i 

-n+1 


.(p) 


-u] 




( p|d 


(4.16) 


X. L. 

The components of the (2m) order vector are similarly determined from 
relations (4.7) and (4.2) to be 


*n + l 


u 


N+l-m 


u 


N+2-m 


0 

0 


Notice the 
m trailing zeros 


(4.17) 
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Substituting for from (4.17) into (4.15) gives 
2m 2m 

A n P)(i ’j } Xft+iU) = y^ P> (i ) (4.18) 

J=m+1 i=m+l 

for the trailing m equations. 

th 

The components of the (2m; order vector x &+ -j are very easily determined 
from (4.16) and (4.18) through a standard elimination process. 

i. L 

Substituting now for in (4.12) yields the remaining (n-1) (2m; order 
vector x.. (l<.j<n and jVa+1 ) . 

J 

The unknowns u. of the Banded System are recovered easily from x., ( 1 <j <n ) 
J J 

through the inverse relations in (4.8) and (4.2): Evidently u^. = x . +m+ -| , 

0=1 ,2, ...N. 
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5.0 Implementation of the CRAT algorithm on the VPS 32 supercomputer 


The performance of the VPS_32 supercomputer, is much more sensitive to the 
coding technique. Moreover the choice of implementing an efficient algorithm may 
be different for different supercomputers. The following implementation is 
proposed for the VPS_32 supercomputer. 

5.1 Data Organization 

The VPS_32 works efficiently with long vector operands. The following 
mapping scheme provides the means for processing vector operands of minimum length 
N, the linear dimension of the Banded System. 

Let A.. , B.. , i = l,2,...n, represent the triangular block matrices described 
in (4.4.1). 

These matrices are held in the arrays C and B as follows: 


and 


C[i] 


A i , l<i<a C = n/2 ) 
B.j , a+1 <1 <n 



l<i<a C= n/2 ) 


£+Uiin 


C[i] and B[i], l<i<n, may be viewed as a set of m Nth order vectors: 

C[i] f > C ( 1 + (i-l)m, m; N) 

Given the above, the normalization is easily accomplished via. 


CCi] f C _1 [i] B[i ] 


( 1 <i <n ) 
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5.2 Reduction Stage 


Each of the reduction phases for the CRAT algorithm can be efficiently 
implemented on the VPS_32 by employing dynamic mode partitioning technique. 

In a dynamic mode partitioning, the partitions of the subarrays are allowed 
to vary over the entire parent array, so that composite subarrays behave as single 
data items during processing. 

For example, suppose C is given by: 


C[i] = 


L A 3 J 


then C may be processed as 


, 0 ) 


>( 1 ) 


where is the composite subarray, 


, 0 ) - 


L H 2J 


and A^ [A3], 


Alternatively, C may be processed as 


, 0 ) 


, 0 ) 


where a|^= [A-j] and A^Ms the composite 
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subarray. 



The dynamic mode partitioning scheme for the CRAT algorithm is illustrated in 
figures 5.0 to 5.4. 

Each of the Reduction phases can be carried out in a time proportional to m; 
the total time for the CRAT algorithm is then 
T = k m [log^n -1] + V m< 
where k is a constant; 

V m is the time required to solve a 2mx2m system of linear equations on the VPS_32 
supercomputer. 
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blocks to "triangular forms". 
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Figure 5.1 First reduction stage. 
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Figure 5.2 Second reduction stage 
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Figure 5.3 Final reduction stage. 
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Figure 5.4 Elimination stage. 


1 


0 0 0 
0 0 0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 1 
0 0 
0 0 


1 


1 



6.0 Future Developments 


We propose a one year program which includes the following tasks: 

TASK 1 

Development and implementation of the Computer Code for the "Customized 
Reduction of Augmented Triangles" Algorithm 
TASK 2 

A study of CRAT algorithm for stability properties and possible 
variations of the algorithm 
TASK 3 

Further investigation of the Vector Cyclic Reduction Algorithm and Block 
LU decomposition Algorithm 
TASK 4 

Adaptation of the CRAT algortihm for solving banded systems with 
multiple right hand sides. 
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7.0 Comments 


The CRAT algorithm, in its present form, employs no pivoting strategy within 
the triangular block matrices. However, column pivoting within the blocks is 
possible at 'little' cost. The CRAT algorithm successfully avoids the 
instabilities of the VCR algorithm for singular diagonal block matrices. CRAT is 
tailored to the VPS_32 pipeline architecture and therefore it should perform 
efficiently on the VPS_32 supercomputer. 
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APPENDIX 


Subroutine VCR- I (A, 
returns the solution 

PARAMETERS ARE: 

N 

LG2N 

LMAX 

D 


A 


Fig. A-I 

D, NDIM, LMAX, N, LG2N) 

of a tridiagonal linear system of equations. 


INTEGER; Must be set to the num- 
ber of equations in the system; 
Value unchanged upon exit. 

INTEGER; Must be set to log N; 

A 

Value unchanged upon exit, 

— INTEGER; Must be set to 3N; Value 
unchanged upon exit, 

, VECTOR; Length "LMAX" ; Must be 

set as follows 

1 < i < N 

N + 1 < i < 2N - 1 
i = 2N 

2N + 1 £ i < 3N -■ 1 
i = 3N 

L. d^ Cl <- i N) represents the dia- 
gonal entries; Initial values are 
changed upon exit. 

REAL VECTOR; Length 5N ; Must be 

set as follows 


D(i) 


d. 

l 

d. 

i 


d. 

l 


- N+l 


- 2N 


iff 

iff 

iff 

iff 

iff 
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A(i) 


1 

iff 

1 < i < N 


L. 

l - n 

iff 

N + 1 < i < 2N 


a i-2N+l 

iff 

2N + 1 < i < 3N 

- 1 

0 

iff 

i = 3N 


b i - 3N 

iff 

3N + 1 < i < 4N 

- 1 

0 

iff 

i = 4N 


L i-4N+1 

iff 

4N + 1 < i < 5N 


L • m Ci • » 

- l' X' 

and 

represent the 

right 


hand vector, sub-diagonal entries 
and the super- diagonal entries 
respectively. 

Initial Values are changed upon exit*. 
A(i) contains the solution vector x^ 
(i = 1, 2 .. N) respectively. 
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SUBROUTINE VCRI (A, D, NDIM, LMAX, N, LBEN) 


DIMENSION A (NDIM) , D(LMAX) 

LI=N+1 

LJ=LI+N 


LK=LJ+N 

LL=LK+N 

LM=N+N 

C NORMALIZE : RHS, SUB-, SUPER- DIAGS 
A (LI ;LMAX) = A (LI ; LMAX ) /D ( 1 ; LMAX ) 

C REMAP RHS: 

A (LL ; N) =A (LI + 1 ;N) 

ISTEP=LG£N-1 

K=1 

KK=£ 

NK=N— 1 

DO 40 1=1, I STEP 

C COMPUTE PRODUCTS: 

D ( 1 ; LMAX ) =A (LI ; LMAX > *A (LJ ; LMAX ) 

C UPDATE RHS & MAIM DIABONAL: 

A d ; N > = 1 . 0 

A(l; LM ) = A ( 1 ; LM ) -D ( L I ; LM ) 

A ( KK ; NK ) =A ( KK ; NK ) -D ( L I ; NK ) 

A ( L I ; NK ) = A (LI ; NK ) -D ( K ; NK ) 

C UPDATE SUB-, SUPER- DIAGONALS : 

MK=LM— K 

A ( L J ; MK ) =— A ( L J ; MK ) * A ( L J ; MK ) 

K=K+K 

KK=K+1 

NK=N-K 

C NORMALIZE RHS, SUB- SUPER-DIAGS: 

A (LI ;N) =A (LI ;N) /A(l ;N) 

A (LJ ; NK) =A (LJ ; NK) / A ( KK ; NK ) 

A (LK ; NK) =A ( LK ; NK ) /A (1 ;NK) 


40 A ( LL ; NK ) = A ( L I +K ; NK ) 

C SOLVE FOR UNKNOWN VECTOR: 

A (LK+K ;K) =A (LI ;K) 

A(1;K)=1. 0— A ( L J ; K ) 

A ( KK ; K ) = A ( 1 ; K > 

A(1 ;N)=(A(LI ;N)-A(LK;N) ) /A(l ;N) 

RETURN 

END 
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Fig... Ar7.II 


Subroutine VCR- 1 1 " (CV, A, D, NDIM, LMAX,. N, LG2N) 

returns the - solution of a tridiagonal- linear system of equations. 

PARAMETERS ARE: 


N 


LG2N 


LMAX 


D 


A 


INTEGER; Must be set to the num- 
ber of equations in the system; 
Value unchanged upon exit, 

— INTEGER; Must be set to log N; 

A 

Value unchanged upon exit, 

««- INTEGER; Must be set to 3N; Value 
unchanged upon exit, 

«« ~ VECTOR; Length "LMAX"; Must be 
set as follows 


d. 

i 

iff 

1 < i < N 

d i - N+i 

iff 

N + 1 < i < 2N 

1 

iff 

i = 2N 

d i - 2N 

iff 

2N + 1 < i £ 3N 

1 

iff 

i = 3N 


(_ d^ .Cl £ x N) represents the dia- 
gonal entries; Initial values are 
changed upon exit. 

— — REAL VECTOR; Length 5N ; Must be 
set as follows 
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cv 



- 1 

iff 

1 < i < N 



L. 

i - n 

iff 

N + 1 < i < : 

2N 


a i-2N+l 

iff 

2N + 1 <. i < 

3N 

A(i) < — 

0 

iff 

i = 3N 



b i - 3N 

iff 

3N + 1 < i < 

4N 


0 

iff 

i = 4N 



L i-4N+1 

iff 

4N + 1 < i < 

5N 


L^, a^, and b^ represent the right 
hand vector, subr-diagonal entries 
and the super- diagonal entries 
respectively . 

Initial Values are changed upon exit*. 
A(i) contains the solution vector, x^ 
(i = 1, 2 . . N) respectively. 

BIT VECTOR; Length ' LMAX ' ; Must be 
set as follows: 

CV (i) = B ' 1 * , (1 < i < LMAX) 
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SUBROUTINE VCRI I (CV, A, D, NDIM, LMAX, N, LG2N) 


DIMENSION PI (NDIM) , D(LMPlX) 

BIT CV(LMPIX) 

LI=N+1 
L J=LI+N 

LK=LJ+N 

LL=LK+N 

LM=N+N 

C NORMALIZE : RHS, SUB-, SUPER- DIPIGS 
C REMAP RHS: 

A (LL;N) =A (LI + 1 ;N> 

ISTEP=LG£N— 1 

K =1 

KK =2 

NK=N— 1 

DO 40 1=1, ISTEP 

C COMPUTE PRODUCTS: 

D ( 1 ; LMAX ) =A (LI ,- LMAX ) *A <LJ ,- LMAX ) 

C UPDATE RHS & MAIN DIAGONAL: 

A < 1 ; N) =1 . 0 

A < 1 ; LM ) = A ( 1 ; LM ) -D ( L 1 5 LM ) 

CV < 1 ; MK) =GQVMKO <NK, N ; CV ( 1 ;MK) ) 

D ( 1 5 MK ) =Q 8 VMERGE ( D ( L I ; NK ) , D ( 1 ; N) , CV ( 1 ; MK) ; D ( 1 ; MK) ) 
A ( KK ; MK ) =A < KK ; MK ) -D ( 1 ; MK ) 

C UPDATE SUB-, SUPER- DIAGONALS: 

MK=LM— K 

A (LJ ; MK) =— A ( LJ ; MK) *A <LJ ; MK) 

K=K+K 

KK=K+1 

NK=N-K 

C NORMALIZE RHS, SUB- SUPER-DIAGS: 

CV ( 1 ; MK) =Q 8 VMKQ (N, NK ; CV ( 1 ; MK) ) 

D ( 1 ; MK ) =Q 8 VMERGE ( A < 1 ; N) , A (KK ; NK) , CV (1 ;MK) ; D ( 1 ; MK) ) 
A (LI ; MK) =A (LI ; MK) /D ( 1 ;MK) 

A ( LK ; NK ) = A ( LK ; NK ) / A ( 1 ; NK ) 

40 A (LL ; NK) =A (LI+K ; NK) 

C SOLVE FOR UNKNOWN VECTOR: 

A(1 ;K)=1. O— A ( L J ; K ) 

A ( KK ; K ) = A ( 1 ; K ) 

D ( 1 ; N) =Q 8 VMERGE ( A ( LI ;K) , A (LK+K ; K) , CV (KK ; N) ;D (1 ;N) ) 
A(1;N)=(A(LI;N)-D(1;N))/A(1;N) 

A ( 1 ; N) =A (LI ; N) /A(1;N> 

RETURN 

END 
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Subroutine PRELUDE (P,R, X, D, NMAX, LG2N, N) returns the 
solution of a tridiagonal linear system of equations. 


PARAMETERS ARE: 
N 


LG2N 


NMAX 


D 


X 


-r-.- INTEGER; must be set to the 
number of equations in the 
system; Value unchanged upon 
exit . 

INTEGER; Must be set to log 2 N; 
Value unchanged upon exit. 

-r.- INTEGER; Must be set to 5N; 

Value unchanged upon exit. 

— . REAL VECTOR; Length N; Must be 
set to zero. Upon exit 
D(i) contains the solution 
x^(l £ i £ N) respectively. 

■ REAL VECTOR; Length 5N; Must 
be set as follows 


r a . 
i 

/ 

1 < i £ N 

d. 

1 

/ 

N + 1 £ i £ 2N 

b. 

i 

/ 

2N + 1 £ i < 3N 

*i 

$ 

3N + 1 £ i £ 4N 

.0 

f 

otherwise 


where y^, a^, b^ d^ represent 
the right hand vector, sub-, 
super-, and main diagonals of 


43 



p 


R 


the band matrix; Values 
changed upon exit. 

REAL VECTOR; Length N; 
Working array. 
r-.-~ REAL VECTOR; Length N; 
Working array. 
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FIG. A3 


SUBROUTINE PRELUDE <P, R, X, D, NMAX, LG2N, N) 
DIMENSION DIAG(N) , X (NMAX) , P (N) , R (N) 


I=N/2+l 

J=N*2+1 

L=N*3+I 


X (L+N/2 ; N) =0. 0 
R < 1 ; N) =X ( J+N ;N) 

X <3*N+1 ;N/2) =0. 0 
DIflG(l;N)=X<N+l;N) 

X (N+l ;N) =0. 0 
P ( 1 ; N) =X ( 1 ;N) 

X ( 1 ; N/£) =0. 0 
K=1 

DO 10 ISTEP=1, LG2N 

X (I;N)=P(J;N) /DIAG<1 ;N) 

X <L;N)=R(1;N) /DIAG ( 1 ;N> 

DIAG(1;N>=1. -X (I ;N)*X <J-K;N)-X (J;N) *X <I+K;N) 

R < 1 ; N) =X (L;N) -X (I ;N) *X (L-K;N)-X <J;N)*X (L+K;N> 
P(1 ;N)=-X(I;N)*X <I-K;N) 

X ( J ; N) =— X (J;N)*X (J+K;N) 


10 K=K+K 

DIAG ( 1 ; N) =R ( 1 ; N) /DIAG ( 1 ; N) 

RETURN 

END 
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End of Document 



